Language Identification on the Web: Extending the Dictionary Method

نویسندگان

  • Radim Rehurek
  • Milan Kolkus
چکیده

Automated language identification of written text is a wellestablished research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n-grams are in use, mainly with identification based on Markov models or on character n-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dictionary of Abstract and Concrete Words of the Russian Language: A Methodology for Creation and Application

The paper describes the first stage of a project on creating an electronic dictionary with numerical estimates of the degree of abstractness and concreteness of Russian words. Our approach is to integrate data obtained from several different sources: text corpora, psycholinguistic experiments, published dictionaries, markers of abstractness (certain suffixes) and a translation of a similar dict...

متن کامل

A Supervised Method for Constructing Sentiment Lexicon in Persian Language

Due to the increasing growth of digital content on the internet and social media, sentiment analysis problem is one of the emerging fields. This problem deals with information extraction and knowledge discovery from textual data using natural language processing has attracted the attention of many researchers. Construction of sentiment lexicon as a valuable language resource is a one of the imp...

متن کامل

EFL Translation Students' Perspective toward Using Bilingual Dictionary in Translation of Polysemous Words

This research presented the use of bilingual dictionary and addressed the EFL translation students' points of view on the use of bilingual dictionary in translating polysemous words (English to Persian). Moreo- ver, it aimed at finding the possible relationship between the effect of using bilingual dictionary by stu- dents in translating polysemous words and their achieved scores. In the study ...

متن کامل

Arabic Script Web Document Language Identifications Using Neural Network

This paper presents experiments in identifying language of Arabic script web documents using neural network. There are some difficulties when identifying those languages in Arabic script such as Persian, Turkish, Urdu, Jawi etc. Since there is a vast amount of information presented to the internet users, it is crucial to find an appropriate method in language identification for a variety of tex...

متن کامل

The effect of three vocabulary techniques on the Iranian ESP learners’ vocabulary production

The present study aimed to examine the effect of three vocabulary techniques (dictionary use, etymological analysis, and glossing) on the Iranian ESP learners' vocabulary production. Forty-five university students majoring in architecture at Azad University, Anzali branch,  participated in this study. They were divided into three groups, and each group was randomly assigned to one kind of treat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009